The following material has been adapted from the 2023 Swiss Institute of Bioinformatics single cell RNA sequencing course (https://sib-swiss.github.io/single-cell-training/) authored by Tania Wyss, Rachel Marcone-Jeitziner, Geert van Geest, and Patricia Palagi

library(Seurat)
library(ggplot2)
library(dplyr)

Load the Seurat object if you need to

load("seu_preprocess_dimreduction.RData")

Build the graph used for clustering

The FindNeighbors function adds a graph to the Seurat object that is used for clustering (FindClusters used to do this automatically but not any more). Key arguments are dims (the number of PCs) and k.param (the number of neighbours) - default is 20. Default is usually fine but sometimes it helps clustering to have the same number of neighbours as in the RunUMAP function (default for RunUMAP is 30). If clusters are mixed together on the UMAP plot then matching n.neighbors in RunUMAP and k.param in FindNeighbors can sometimes help

seu <- Seurat::FindNeighbors(seu, dims = 1:25)
## Computing nearest neighbor graph
## Computing SNN
Inspect the Seurat object to see what has changed

Next we will run FindClusters

There are many important parameters for FindClusters, particularly the resolution and algorithm and method. The default algorithm in Seurat is the Louvain algorithm. The Leiden algorithm is an improvement on Louvain but it does require the installation of Python and the leidenalg package (see Appendix). We will use Louvain today but it is worth looking into setting up Leiden clustering

# Run FindClusters at multiple resolutions
seu <- Seurat::FindClusters(seu, resolution = seq(0.1, 0.8, by=0.1))
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6830
## Number of edges: 270007
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9788
## Number of communities: 7
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6830
## Number of edges: 270007
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9663
## Number of communities: 12
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6830
## Number of edges: 270007
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9571
## Number of communities: 14
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6830
## Number of edges: 270007
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9491
## Number of communities: 16
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6830
## Number of edges: 270007
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9413
## Number of communities: 17
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6830
## Number of edges: 270007
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9341
## Number of communities: 18
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6830
## Number of edges: 270007
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9267
## Number of communities: 19
## Elapsed time: 0 seconds
## Modularity Optimizer version 1.3.0 by Ludo Waltman and Nees Jan van Eck
## 
## Number of nodes: 6830
## Number of edges: 270007
## 
## Running Louvain algorithm...
## Maximum modularity in 10 random starts: 0.9196
## Number of communities: 21
## Elapsed time: 0 seconds

Now let’s examine the metadata

head(seu[[]])
##                            orig.ident nCount_RNA nFeature_RNA percent.mito
## PBMMC-1_AAACCTGCAGACGCAA-1    PBMMC-1       2401          909     2.540608
## PBMMC-1_AAACCTGTCATCACCC-1    PBMMC-1       3532          760     5.181200
## PBMMC-1_AAAGATGCATAAAGGT-1    PBMMC-1       3972         1215     4.934542
## PBMMC-1_AAAGCAAAGCAGCGTA-1    PBMMC-1       3569          894     3.250210
## PBMMC-1_AAAGCAACAATAACGA-1    PBMMC-1       2982          730     3.688799
## PBMMC-1_AAAGCAACATCAGTCA-1    PBMMC-1      22284         3108     3.181655
##                            percent.ribo percent.globin   ident   sum detected
## PBMMC-1_AAACCTGCAGACGCAA-1     28.65473      0.1665973 PBMMC-1  2401      909
## PBMMC-1_AAACCTGTCATCACCC-1     55.03964      0.1981880 PBMMC-1  3532      760
## PBMMC-1_AAAGATGCATAAAGGT-1     30.43807      0.3776435 PBMMC-1  3972     1215
## PBMMC-1_AAAGCAAAGCAGCGTA-1     55.02942      0.3642477 PBMMC-1  3569      894
## PBMMC-1_AAAGCAACAATAACGA-1     54.49363      0.1006036 PBMMC-1  2982      730
## PBMMC-1_AAAGCAACATCAGTCA-1     23.40693     36.9682283 PBMMC-1 22284     3108
##                            percent.top_50 percent.top_100 percent.top_200
## PBMMC-1_AAACCTGCAGACGCAA-1       43.27364        56.35152        68.05498
## PBMMC-1_AAACCTGTCATCACCC-1       60.64553        75.79275        83.32390
## PBMMC-1_AAAGATGCATAAAGGT-1       41.69184        56.92346        69.36052
## PBMMC-1_AAAGCAAAGCAGCGTA-1       51.05071        68.75876        78.53741
## PBMMC-1_AAAGCAACAATAACGA-1       53.75587        71.73038        81.52247
## PBMMC-1_AAAGCAACATCAGTCA-1       58.06408        69.09891        76.53025
##                            percent.top_500 total    S.Score  G2M.Score Phase
## PBMMC-1_AAACCTGCAGACGCAA-1        82.96543  2401 -0.2630557 -0.5250315    G1
## PBMMC-1_AAACCTGTCATCACCC-1        92.63873  3532 -0.4838513 -0.7964061    G1
## PBMMC-1_AAAGATGCATAAAGGT-1        81.99899  3972 -0.6049204 -1.0335645    G1
## PBMMC-1_AAAGCAAAGCAGCGTA-1        88.96049  3569 -0.5251903 -0.8884195    G1
## PBMMC-1_AAAGCAACAATAACGA-1        92.28706  2982 -0.4301263 -0.6948718    G1
## PBMMC-1_AAAGCAACATCAGTCA-1        83.75067 22284 -0.3034593 -2.3294031    G1
##                            RNA_snn_res.0.1 RNA_snn_res.0.2 RNA_snn_res.0.3
## PBMMC-1_AAACCTGCAGACGCAA-1               3               9               9
## PBMMC-1_AAACCTGTCATCACCC-1               1               3               1
## PBMMC-1_AAAGATGCATAAAGGT-1               2               2               7
## PBMMC-1_AAAGCAAAGCAGCGTA-1               1               3               1
## PBMMC-1_AAAGCAACAATAACGA-1               1               3               1
## PBMMC-1_AAAGCAACATCAGTCA-1               4               4               2
##                            RNA_snn_res.0.4 RNA_snn_res.0.5 RNA_snn_res.0.6
## PBMMC-1_AAACCTGCAGACGCAA-1               9               9               9
## PBMMC-1_AAACCTGTCATCACCC-1              12              13              13
## PBMMC-1_AAAGATGCATAAAGGT-1               7               6              15
## PBMMC-1_AAAGCAAAGCAGCGTA-1              12              13              13
## PBMMC-1_AAAGCAACAATAACGA-1              12              13              13
## PBMMC-1_AAAGCAACATCAGTCA-1               1               7               6
##                            RNA_snn_res.0.7 RNA_snn_res.0.8 seurat_clusters
## PBMMC-1_AAACCTGCAGACGCAA-1               9              15              15
## PBMMC-1_AAACCTGTCATCACCC-1              13              13              13
## PBMMC-1_AAAGATGCATAAAGGT-1              16              17              17
## PBMMC-1_AAAGCAAAGCAGCGTA-1              13              13              13
## PBMMC-1_AAAGCAACAATAACGA-1              13              13              13
## PBMMC-1_AAAGCAACATCAGTCA-1               5               5               5

Notes on the clusters

The clusters are named based on the assay name (RNA), the graph used (snn), and the resolution that the clustering was run at (res.0.1-res.0.8). The default cluster is “seurat_clusters” and is the most recently added clustering. In this case we did clustering from 0.1 to 0.8 so the last cluster was at resolution 0.8. After clustering, the active.ident is now seurat_clusters. Let’s have a look at the different clusters on some UMAPs. I am only plotting them 4 at a time so that they are easier to see. Note: Clusters get progressively smaller so cluster 0 is always the biggest

# An easy way of seeing the size of each cluster
summary(seu$seurat_clusters)
##   0   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18  19 
## 844 633 582 563 523 477 440 417 412 298 255 245 228 175 161 137 135 112  71  69 
##  20 
##  53
# First four resolutions
Seurat::DimPlot(seu, group.by = c("RNA_snn_res.0.1","RNA_snn_res.0.2",
                                  "RNA_snn_res.0.3","RNA_snn_res.0.4"),ncol=2)

# optionally save the UMAP plots using ggsave
#ggsave("find_clusters_res0.1to0.4_umap.png",width=20,height=20,units="cm")

# Last four resolutions but this time let's put the labels on the clusters
# and let's remove that annoying legend
Seurat::DimPlot(seu, group.by = c("RNA_snn_res.0.5","RNA_snn_res.0.6",
                                  "RNA_snn_res.0.7","RNA_snn_res.0.8"),
                ncol=2, label=T, combine=T) & NoLegend()

Clustering comments

The lowest clustering resolution seems to under cluster as it just identifies the major groups on the UMAP. The clusters don’t change much after about resolution 0.4.

Are the clusters sample specific?

On the earlier UMAPs, some of you may have noticed that there was separation based on sample. In some cases (eg disease vs control) this might be expected but these are three control samples so we would not expect there to be big differences

DimPlot(seu, reduction = "umap",group.by = "orig.ident")

We know that there were more erythroid cells in PBMMC-2, but there is separation in other areas of the UMAP as well. Particularly at the bottom of the UMAP where there was high expression of the T cell genes CD3D and CD3E. Let’s use a stacked barplot to look at the proportion of cells within each cluster that come from each sample

# stacked barplot
ggplot(seu[[]], aes(fill=orig.ident,x=RNA_snn_res.0.3))+
  geom_bar(position="fill") +
  theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0.5)) + 
  labs(fill="Sample")

#ggsave("louvain_samples_percluster_res0.3.png",width=15,height=10,units="cm")

# umap of the same cluster resolution
DimPlot(seu, reduction = "umap",group.by = "RNA_snn_res.0.3",label = T)

#ggsave("umap_louvain_res0.3.png",width=15,height=10,units="cm")
What can we see and what can we do about it?

Focusing on the bottom group of cells; Cluster 2 is mostly sample 3, cluster 4 is mostly sample 2, and cluster 8 is mostly sample 1. To try to get the samples to cluster together we can try integration

Integration/batch correction

There are many batch correction/integration methods available including some that are built into Seurat. As a demonstration of using an external package, we are going to use Harmony instead. Harmony is a fast and widely used (over 1000 citations) method for batch correction/integration. It requires you to have run the preprocessing on a combined object of all samples (as we have done here already) rather than for each sample one at a time. For further information you can refer to https://portals.broadinstitute.org/harmony/SeuratV3.html. For information about the Seurat integration methods, refer to the Seurat website

# Install harmony if you haven't already - install.packages("harmony")
library(harmony)
## Loading required package: Rcpp
# always set.seed before running harmony otherwise the results will change each time
# it doesn't have a seed argument included like Seurat functions
set.seed(123)
# just ignore the warning about the invalid name
seu <- RunHarmony(seu, group.by.vars = c("orig.ident"), 
                  reduction = "pca",reduction.save = "harmony")
## Harmony 1/10
## Harmony 2/10
## Harmony 3/10
## Harmony 4/10
## Harmony 5/10
## Harmony 6/10
## Harmony 7/10
## Harmony converged after 7 iterations
## Warning: Invalid name supplied, making object name syntactically valid. New
## object name is Seurat..ProjectDim.RNA.harmony; see ?make.names for more details
## on syntax validity
# we can run an ElbowPlot again to look at the harmony components
ElbowPlot(seu,reduction = "harmony",ndims = 50)

# looks similar to the PCA elbowplot so will go with 25 again

# let's compare the grouping on the harmony components and principal components
harmony1and2 <- DimPlot(seu,reduction="harmony",group.by = "orig.ident")
pca1and2 <- DimPlot(seu,reduction="pca",group.by = "orig.ident")
# plot them side by side
pca1and2 + harmony1and2

Comments

There seems to be better integration following harmony, particularly on the left side of the plot. As expected, there are still areas that are mostly PBMMC-2 but that is because that sample had more erythroid cells. This is actually a good indication that there hasn’t been “over-correction” of the batch effect

Run UMAP, neighbours, and clustering on the harmony components instead of PCA

Firstly, it is a good idea to save the Seurat object before running RunUMAP as the functions will overwrite the results from the unintegrated data. Here I am making a new Seurat object called seu_noIntegrate that contains the unintegrated results before proceeding

seu_noIntegrate <- seu
Run the main functions again
seu <- RunUMAP(seu, reduction = "harmony", dims = 1:25)
## Warning: The default method for RunUMAP has changed from calling Python UMAP via reticulate to the R-native UWOT using the cosine metric
## To use Python UMAP via reticulate, set umap.method to 'umap-learn' and metric to 'correlation'
## This message will be shown once per session
seu <- FindNeighbors(object = seu, reduction = "harmony",dims = 1:25)
seu <- FindClusters(seu, resolution = seq(0.1, 0.8, by=0.1))

Do not run this code block

This loads the clusters and integration results that I had generated previously. RunHarmony and RunUMAP can yield slightly different results when run repeatedly on the same data.

load("seu_clusters_integrated_cell_assign.RData")

Plotting the Harmony results

Plot the Harmony-integrated UMAP

Seurat::DimPlot(seu,reduction = "umap",
                group.by = "RNA_snn_res.0.3")

# Optionally save the graph as a png file
#ggsave("harmony_cluster_0.3.png",width=15,height=15,units="cm")

Compare it to the original UMAP split by sample

Seurat::DimPlot(seu_noIntegrate,reduction = "umap",
                group.by = "RNA_snn_res.0.3")

#ggsave("unintegrated_cluster_0.3.png",width=15,height=15,units="cm")

Plot the Harmony-integrated UMAP split by sample

Seurat::DimPlot(seu,reduction = "umap",
                split.by = c("orig.ident"),group.by = "RNA_snn_res.0.3")

#ggsave("harmony_cluster_0.3_splitby_sample.png",width=20,height=10,units="cm")

Compare it to the original UMAP split by sample

Seurat::DimPlot(seu_noIntegrate,reduction = "umap",
                split.by = c("orig.ident"),group.by = "RNA_snn_res.0.3")

#ggsave("unintegrated_cluster_0.3_splitby_sample.png",width=20,height=10,units="cm")

# Stacked barplot of the integrated clusters
# There are now fewer sample-specific clusters
ggplot(seu[[]], aes(fill=orig.ident,x=RNA_snn_res.0.3))+
  geom_bar(position="fill") +
  theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0.5)) + 
  labs(fill="Sample")

#ggsave("harmony_integrated_barplot_res0.3.png",width=10,height=10,units="cm")

Cell type assignment

#################################################################################

The first step in cell type assignment is to decide on the cluster resolution that you want to use. Once you have decided on the optimal cluster resolution, then you perform steps to identify the cells in each cluster. Depending on your dataset you might want to do a “coarse” cluster resolution where you just separate into cell types as well as a finer clustering into cell subtypes

# The lower cluster resolutions looked better on the original data
Seurat::DimPlot(seu, group.by = c("RNA_snn_res.0.2","RNA_snn_res.0.3",
                                  "RNA_snn_res.0.4","RNA_snn_res.0.5"),label = T)

Some of the clusterings are very similar and your clusters may look slightly different to what is here. I am going to use 0.3 for now (you can always change your mind later). Let’s set our active.ident as the 0.3 clustering

# the lower resolution at 0.3 looks reasonably good - not too many overlapping clusters
seu <- Seurat::SetIdent(seu, value = seu$RNA_snn_res.0.3)
# Let's plot the UMAP with labels
Seurat::DimPlot(seu, group.by = c("RNA_snn_res.0.3"),label = T)

Plotting cell marker genes

We have talked a lot about erythroid cells so let’s plot some markers of other cell types instead. We are saving the marker genes as vectors. Feel free to add marker genes for other cell types

# Define our cell marker lists
tcell_genes <- c("IL7R", "LTB", "TRAC", "CD3D")
monocyte_genes <- c("CD14", "CST3", "CD68", "CTSS")

Now let’s do a feature plot with T cell genes

ncol controls how many columns there are in the output when you are making multiple plots at the same time. By default, the normalised counts are plotted

Seurat::FeaturePlot(seu, features=tcell_genes, ncol=2)

# Looks like clusters 0 and 10 are highest for the T cell markers on the UMAP
# Let's confirm that with a violin plot instead
Seurat::VlnPlot(seu,group.by = ("RNA_snn_res.0.3"),
                features = tcell_genes,
                ncol = 2)

A note on marker specificity and protein markers

Sometimes protein-level cell markers don’t come up strongly in RNA-seq data. Sometimes, commonly used cell markers aren’t actually as specific to a particular cell type as people would like them to be as well. If there is no universally recognised gold-standard marker for a cell type then it is always a good idea to plot multiple markers. In the plot above, LTB is up in multiple clusters but if you look across all genes then it is clear that the genes are only consistently higher in clusters 0 and 10.

Monocyte genes

# Now let's do the same for the monocyte genes starting with a UMAP
Seurat::FeaturePlot(seu, features=monocyte_genes, ncol=2)

# Let's check with a violin plot as well
Seurat::VlnPlot(seu,
                features = monocyte_genes,
                ncol = 2)

# Cluster 2 consistently highest with 7 higher for some markers

Note - Gene names can differ between species and can differ from protein name

Gene names can differ between species (eg mouse and human) so make sure that you are using the correct name. R may return an error saying gene not found when in reality you just aren’t using the right gene name for your target species. Similarly, publications will often refer to protein names which can often differ from the name of the gene that encodes the protein (eg NK1.1 comes from the Klrb1c gene in mice). NCBI gene is a good place to check to confirm that you are using the correct gene name https://www.ncbi.nlm.nih.gov/gene.

Visualising using AddModuleScore

An alternative approach to plotting multiple genes is to use AddModuleScore to make a combined score from a list of marker genes. The CellCycleScoring function that we used earlier actually uses the AddModuleScore function to summarise the expression of the cell cycle marker genes. Here we will use it for the T cell and monocyte marker genes

# Note that AddModuleScore expects a list. If you just put in tcell_genes instead
# of list(tcell_genes) then it would run AddModuleScore on each gene individually
seu <- Seurat::AddModuleScore(seu,
                                  features = list(tcell_genes),
                                  name = "tcell_genes")

Where has the result been stored? What is it called?

Result is stored in the metadata and for some reason it adds a 1 on to the end of the name that has been specified (ie tcell_genes1 instead of tcell_genes). Now let’s plot the results.

# UMAP of tcell_genes1
Seurat::FeaturePlot(seu, features="tcell_genes1")

# VlnPlot of tcell_genes1
Seurat::VlnPlot(seu, features="tcell_genes1")

What does the y-axis mean? Is there an alternative to AddModuleScore?

AddModuleScore is fast, simple, and produces similar results to available alternatives. One of the issues that I have with AddModuleScore is that it returns negative values which can be misinterpreted as downregulation. AddModuleScore also generates the score using the average expression of other randomly selected genes (number of genes is specified by the ctrl argument to AddModuleScore - default value is 100). This is fine most of the time but can be problematic if you subset your Seurat object and then rerun AddModuleScore again. There is an alternative package called UCell that produces a similar score is more reproducible and returns a score between 0 and 1. See appendix for details

Finding cell markers using differential expression

The FindAllMarkers function sequentially runs differential expression between each cluster and all other cells (ie cluster 0 vs cluster1-13 combined). The idea is to find the genes that are highest in each cluster compared to other clusters. A column by column explanation of the output object (de_genes) is in the appendix

de_genes <- Seurat::FindAllMarkers(seu,  min.pct = 0.25, ident="RNA_snn_res.0.3",
                                   only.pos = TRUE)
## Calculating cluster 0
## Calculating cluster 1
## Calculating cluster 2
## Calculating cluster 3
## Calculating cluster 4
## Calculating cluster 5
## Calculating cluster 6
## Calculating cluster 7
## Calculating cluster 8
## Calculating cluster 9
## Calculating cluster 10
## Calculating cluster 11
## Calculating cluster 12
# Let's examine the output
head(de_genes)
##        p_val avg_log2FC pct.1 pct.2 p_val_adj cluster   gene
## IFITM1     0   2.179555 0.766 0.193         0       0 IFITM1
## CD3D       0   2.162188 0.813 0.214         0       0   CD3D
## TRAC       0   2.119884 0.710 0.178         0       0   TRAC
## CD3E       0   1.939660 0.670 0.124         0       0   CD3E
## LTB        0   1.909197 0.824 0.375         0       0    LTB
## JUNB       0   1.860133 0.892 0.649         0       0   JUNB
# Now let's subset the object to only keep the genes below an adjusted p-value of 0.05
de_genes <- subset(de_genes, de_genes$p_val_adj < 0.05)

Are the T cell genes in the marker lists of clusters 0 and 10?

# Another way of subsetting using the %in% operator
de_genes[de_genes$gene %in% tcell_genes,]
##               p_val avg_log2FC pct.1 pct.2     p_val_adj cluster gene
## CD3D   0.000000e+00  2.1621879 0.813 0.214  0.000000e+00       0 CD3D
## TRAC   0.000000e+00  2.1198845 0.710 0.178  0.000000e+00       0 TRAC
## LTB    0.000000e+00  1.9091972 0.824 0.375  0.000000e+00       0  LTB
## IL7R  1.448832e-295  1.6810492 0.487 0.100 2.705405e-291       0 IL7R
## LTB1   5.390164e-35  1.1532332 0.671 0.465  1.006505e-30       6  LTB
## LTB2   7.063843e-24  0.8327548 0.793 0.468  1.319031e-19       9  LTB
## CD3D1  5.352733e-39  1.4531134 0.771 0.341  9.995157e-35      10 CD3D
## TRAC1  8.990548e-36  1.4475279 0.700 0.289  1.678805e-31      10 TRAC
## IL7R1  2.668579e-10  0.8085502 0.386 0.183  4.983038e-06      10 IL7R

All 4 markers came up for both cluster 0 and 10 ### Saving the output - Be careful with Excel It is generally a good idea to save the output of FindAllMarkers as you may want to run further analysis on the output (eg enrichment) and it can be time consuming to run. A safe way of saving the output is as an RData object but you may also want to save an excel file for showing your results to supervisors/collaborators. However, Excel has an unfixable habit of converting some gene names (eg MARCH1) to dates. If you save your result as a csv, open it in Excel, and then load it back into R again, you will have dates in your gene column.

write.csv(de_genes,
          file="de_genes_FindAllMarkers.csv",
          row.names = F, quote = F)

save(de_genes,file="de_genes_FindAllMarkers.RData")

Manual assignment summary and doublets

The manual assignment workflow involves choosing your cluster resolution, and then identifying each cluster based on the expression levels of known markers. This approach is reliable, but it can be subjective, slow, and does require some expertise about which cell markers are the best. If multiple cell type markers are present in a cluster then it can be an indicator of doublets, particularly if expression is inconsistent across cells. I have included instructions on how you can add manual cluster assignments to the Seurat metadata in the Appendix

Automatic cell type assignment

A quicker method of performing cell type assignment without as much biology knowledge is to use one of the many available tools for cell type assignment. This is an example using two Bioconductor packages

Install packages if necessary and load libraries

# BiocManager::install(c("celldex","SingleR"))
# Note that I have hidden the console output of loading the packages
library(celldex)
library(SingleR)

Choose a reference dataset from Celldex

singleR has multiple reference datasets. We will use a dataset of Hematopoietic cells

# load the dataset as ref
ref <- celldex::NovershternHematopoieticData()
## snapshotDate(): 2022-10-31
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
## see ?celldex and browseVignettes('celldex') for documentation
## loading from cache
class(ref)
## [1] "SummarizedExperiment"
## attr(,"package")
## [1] "SummarizedExperiment"
# show how many marker genes there are for each cell type
table(ref$label.main)
## 
##         B cells       Basophils    CD4+ T cells    CD8+ T cells            CMPs 
##              29               6              21              24               4 
## Dendritic cells     Eosinophils Erythroid cells            GMPs    Granulocytes 
##              10               5              33               4              13 
##            HSCs  Megakaryocytes            MEPs       Monocytes        NK cells 
##              14              12               9               9              14 
##      NK T cells 
##               4

Run singleR using the singleR function

SingleR compares our data to the reference to work out the likely annotations. Since SingleR is a Bioconductor package and we have a Seurat object rather than a SingleCellExperiment, we have to extract our normalised counts using the GetAssayData function (remember slot=data for normalised counts, slot=counts for unnormalised counts). The function returns a DFrame which is a Bioconductor dataframe rather than a regular dataframe. This means that we can use the regular SingleR functions on the results

seu_SingleR <- SingleR::SingleR(test = Seurat::GetAssayData(seu, slot = "data"),
                                    ref = ref,
                                    labels = ref$label.main)
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.
# Quick check of the output
head(seu_SingleR)
## DataFrame with 6 rows and 4 columns
##                                                    scores          labels
##                                                  <matrix>     <character>
## PBMMC-1_AAACCTGCAGACGCAA-1 0.216202:0.197296:0.086435:...         B cells
## PBMMC-1_AAACCTGTCATCACCC-1 0.143005:0.129582:0.170521:...    CD8+ T cells
## PBMMC-1_AAAGATGCATAAAGGT-1 0.113423:0.196264:0.111341:...       Monocytes
## PBMMC-1_AAAGCAAAGCAGCGTA-1 0.166749:0.168504:0.239303:...    CD4+ T cells
## PBMMC-1_AAAGCAACAATAACGA-1 0.102549:0.103979:0.178762:...    CD8+ T cells
## PBMMC-1_AAAGCAACATCAGTCA-1 0.181526:0.147693:0.115841:... Erythroid cells
##                            delta.next   pruned.labels
##                             <numeric>     <character>
## PBMMC-1_AAACCTGCAGACGCAA-1  0.1471065         B cells
## PBMMC-1_AAACCTGTCATCACCC-1  0.0753696    CD8+ T cells
## PBMMC-1_AAAGATGCATAAAGGT-1  0.1274524       Monocytes
## PBMMC-1_AAAGCAAAGCAGCGTA-1  0.0845267    CD4+ T cells
## PBMMC-1_AAAGCAACAATAACGA-1  0.0607683    CD8+ T cells
## PBMMC-1_AAAGCAACATCAGTCA-1  0.1057135 Erythroid cells

Plot the output

# Plot a heatmap using the SingleR plotting function
SingleR::plotScoreHeatmap(seu_SingleR)
## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

## Warning: useNames = NA is deprecated. Instead, specify either useNames = TRUE
## or useNames = TRUE.

Heatmap comments

Some cell types are clear but there are also a lot of overlaps (eg basophils and eosinophils). Some of the cell types look like there aren’t many cells in them at all.

# 

# Some clusters look very small, let's extract the SingleR annotation as a vector
# and then make a table of the results
SingleR_annot <- seu_SingleR$labels
t <- table(SingleR_annot)
t
## SingleR_annot
##         B cells       Basophils    CD4+ T cells    CD8+ T cells            CMPs 
##             775             180             912             706             116 
## Dendritic cells     Eosinophils Erythroid cells            GMPs    Granulocytes 
##             130              56            2206              14             494 
##            HSCs  Megakaryocytes            MEPs       Monocytes        NK cells 
##             283               1              26             907              13 
##      NK T cells 
##              11

We can’t do much with tiny clusters so we are going to arbitrarily remove the celltypes that have less than 10 cells

# Make a vector of names cell types with less than 10 cells and then change it
# to unassigned
other <- names(t)[t < 10]
SingleR_annot[SingleR_annot %in% other] <- "unassigned"

# And finally, let's add the labels into the metadata of the Seurat object
seu <- AddMetaData(seu, metadata= SingleR_annot,col.name="SingleR_annot")

Plot the SingleR labels and compare it to the original clustering

Note that SingleR didn’t actually use our clustering. It purely classified the cells into the groups

# plot the clusters and the annotation
DimPlot(seu,reduction = "umap",group.by = c("SingleR_annot","RNA_snn_res.0.3"),label = T,label.size = 3)

Comments

There is a lot of mixing within some clusters. It seems to have worked better at separating the T cell and erythroid cells between clusters 0 and 1

Stacked barplots of the SingleR results

# Cell types within each sample 
ggplot(seu[[]], aes(fill=SingleR_annot,x=orig.ident))+
  geom_bar(position="fill") +
  theme(axis.text.x=element_text(angle=90, hjust = 1, vjust = 0.5)) + 
  labs(fill="Cell type")

Barplot with the dittoseq package

Now let’s have a look at the breakdown of celltypes per cluster. This time we will use dittoseq as it is designed to choose more distinct colours

dittoSeq::dittoBarPlot(seu, var = "SingleR_annot", group.by = "RNA_snn_res.0.3")

# save the seurat object with clusters
save(seu,file="seu_clusters_integrated_cell_assign.RData")

#########################################################################################
#########################################################################################

Appendix

Leiden algorithm

The Leiden algorithm will run more efficiently with igraph installed as well. Install leidenalg using either;

# pip install leidenalg igraph
# or
# conda install -c conda-forge leidenalg python-igraph

Running the Leiden algorithm

The Leiden algorithm can be specified using algorithm=4. The more efficient “igraph” method can be specified using the method argument

seu <- Seurat::FindClusters(seu, resolution = seq(0.1, 0.8, by=0.1),
                            algorithm = 4,method = "igraph")

UCell

Install UCell from Bioconductor and then load the library.

# BiocManager::install("UCell")
library(UCell)
seu <- AddModuleScore_UCell(seu, features = list(tcell_genes=tcell_genes))

# look at the names of metadata
names(seu[[]])
##  [1] "orig.ident"        "nCount_RNA"        "nFeature_RNA"     
##  [4] "percent.mito"      "percent.ribo"      "percent.globin"   
##  [7] "ident"             "sum"               "detected"         
## [10] "percent.top_50"    "percent.top_100"   "percent.top_200"  
## [13] "percent.top_500"   "total"             "S.Score"          
## [16] "G2M.Score"         "Phase"             "RNA_snn_res.0.1"  
## [19] "RNA_snn_res.0.2"   "RNA_snn_res.0.3"   "RNA_snn_res.0.4"  
## [22] "RNA_snn_res.0.5"   "RNA_snn_res.0.6"   "RNA_snn_res.0.7"  
## [25] "RNA_snn_res.0.8"   "seurat_clusters"   "tcell_genes1"     
## [28] "SingleR_annot"     "tcell_genes_UCell"
# the UCell score is the name of the list with _UCell on the end

# UMAP of tcell_genes1
Seurat::FeaturePlot(seu, features="tcell_genes_UCell")

# VlnPlot of tcell_genes1
Seurat::VlnPlot(seu, features="tcell_genes_UCell")

The message from the violin plot is the same as AddModuleScore, but I like that it returns a score between 0 and 1

FindAllMarkers output

  • A – the row name is the gene name
  • B – p_val = unadjusted p-value
  • C – avg_log2FC – average fold change between the gene in the cluster of interest compared to others
  • D – pct.1 = percentage of cells in the cluster of interest where the gene was detected
  • E – pct.2 = percentage of cells in all other clusters where the gene was detected
  • F – p_val_adj = Bonferroni adjusted p-value
  • F – cluster = the cluster that the gene is a marker for (genes can be markers for multiple clusters)
  • F – gene = self-explanatory ### plot a heatmap of the top 10 markers for each cluster Heatmaps are useful for visualising the expression of marker genes across clusters. They can highlight which clusters have similar expression patterns. Here I am ranking based on fold change as adjusted p-values can be the same for multiple genes
de_genes_top10 <- de_genes %>% group_by(cluster) %>% top_n(n = 10, wt = avg_log2FC)

# scale before running the heatmap RNA_snn_res.0.3
library(viridis)
## Loading required package: viridisLite
# Heatmaps will only work on the genes that are in the scale.data slot
seu <- ScaleData(seu,features=de_genes_top10$gene)
## Centering and scaling data matrix
# I've removed the legend with NoLegend and taken away the y-axis as otherwise
# there would be over 100 gene names down the side
DoHeatmap(seu, features=de_genes_top10$gene, group.by="RNA_snn_res.0.3", label=T) +
  scale_fill_gradientn(colors=viridis(256)) + NoLegend() + theme(axis.text.y = element_blank())
## Scale for fill is already present.
## Adding another scale for fill, which will replace the existing scale.

Heatmap comments

Some clusters have very similar expression - eg 4 and 5. There are also some cells in cluster 1 that seem to have more similar expression to cells in cluster 0. This might be due to us only plotting the top 10 genes from each cluster

Adding manual cell annotation

# Initialise a column
seu$manual_celltype <- rep("other")

# Assign a value to the manual celltype column based on the cluster
seu$manual_celltype[seu$RNA_snn_res.0.3==0] <- "T_cell"
seu$manual_celltype[seu$RNA_snn_res.0.3==10] <- "T_cell"
seu$manual_celltype[seu$RNA_snn_res.0.3==1] <- "Erythroid"
seu$manual_celltype[seu$RNA_snn_res.0.3==5] <- "Erythroid"
seu$manual_celltype[seu$RNA_snn_res.0.3==3] <- "Erythroid"
seu$manual_celltype[seu$RNA_snn_res.0.3==11] <- "Erythroid"
seu$manual_celltype[seu$RNA_snn_res.0.3==2] <- "Monocyte"

DimPlot(seu,reduction = "umap",group.by = c("SingleR_annot","manual_celltype"),label = T,ncol=1,label.size = 3)